Janus pro mix dataset by Mark-ZhouWX · Pull Request #13 · SamitHuang/mindone

Mark-ZhouWX · 2025-03-17T01:41:16Z

What does this PR do?

Add janus pro mix dataset

SamitHuang · 2025-07-21T03:49:48Z

/gemini

gemini-code-assist · 2025-07-21T03:49:59Z

It looks like you might have intended to run a command. The /gemini command requires a subcommand. For example, to request a code review, you can use /gemini review. To get a summary of the pull request, you can use /gemini summary.

If you need a list of all available commands, please use /gemini help.

SamitHuang · 2025-07-21T03:50:18Z

/gemini review

SamitHuang · 2025-07-21T03:51:35Z

/gemini summary

gemini-code-assist

Code Review

This pull request introduces a MixDataset to combine text, VQA, and text-to-image datasets for mixed-task training. The implementation has a few areas for improvement in terms of correctness, efficiency, and adherence to best practices. My review includes suggestions to fix the dataset order, use more efficient numpy operations, and clean up minor code style issues.

gemini-code-assist · 2025-07-21T03:51:59Z

+from janus.train.vqa_dataset import VqaDataset
+
+
+class  MixDataset:


There's an extra space between class and the class name. According to PEP 8, there should be a single space.

Suggested change

class MixDataset:

class MixDataset:

gemini-code-assist · 2025-07-21T03:51:59Z

+    def cumsum(sequence):
+        r, s = [], 0
+        for e in sequence:
+            l = len(e)
+            r.append(l + s)
+            s += l
+        return r


This manual implementation of cumsum can be simplified and made more efficient by using numpy.cumsum. Since numpy is already a dependency, this would be more idiomatic and concise. bisect works fine with NumPy arrays.

Suggested change

def cumsum(sequence):

r, s = [], 0

for e in sequence:

l = len(e)

r.append(l + s)

s += l

return r

def cumsum(sequence):

return np.cumsum([len(e) for e in sequence])

gemini-code-assist · 2025-07-21T03:51:59Z

+        self.default_image_shape = default_image_shape
+        self.max_token_length = max_token_length
+        self.datasets = datasets
+        self.num_dataset = len(datasets)


The instance variable self.num_dataset is assigned but never used. It should be removed to avoid confusion and keep the code clean.

gemini-code-assist · 2025-07-21T03:51:59Z

+        # add image and image_seq_mask item to pure text for batching
+        if dataset_idx == 0:
+            image = np.zeros(self.default_image_shape, np.float32)
+            image_seq_mask = np.zeros((self.max_token_length), dtype=np.bool)


The np.bool alias is deprecated since NumPy 1.20 and will be removed in a future version. You should use the standard Python bool type instead for forward compatibility.

Suggested change

image_seq_mask = np.zeros((self.max_token_length), dtype=np.bool)

image_seq_mask = np.zeros((self.max_token_length), dtype=bool)

gemini-code-assist · 2025-07-21T03:51:59Z

+    sample_weights = []
+    assert len(sample_ratios) == len(datasets)
+    for i in range(len(sample_ratios)):
+        weight = sample_ratios[i] * len(mix_dataset) / len(datasets[i])
+        sample_weights += [weight] * len(datasets[i])


The loop for creating sample_weights can be written more concisely and efficiently using NumPy operations. Using np.repeat is more idiomatic for this kind of task and can be faster for large datasets.

Suggested change

sample_weights = []

assert len(sample_ratios) == len(datasets)

for i in range(len(sample_ratios)):

weight = sample_ratios[i] * len(mix_dataset) / len(datasets[i])

sample_weights += [weight] * len(datasets[i])

assert len(sample_ratios) == len(datasets)

weights_per_dataset = np.array([

r * len(mix_dataset) / len(d) for r, d in zip(sample_ratios, datasets)

])

counts = [len(d) for d in datasets]

sample_weights = np.repeat(weights_per_dataset, counts).tolist()

gemini-code-assist · 2025-07-21T03:52:18Z

Summary of Changes

This pull request introduces a foundational component for multi-modal training within the Janus Pro project. It provides a robust mechanism to combine and sample from disparate datasets—pure text, VQA, and text-to-image—into a single, coherent data stream for MindSpore-based models. This enhancement is crucial for training models that require exposure to a wide variety of data types to improve their generalization and performance across different tasks.

Highlights

New Dataset Implementation: I've added a new MixDataset class that intelligently combines multiple types of datasets, specifically pure text, Visual Question Answering (VQA), and text-to-image data. This allows for a unified training pipeline across diverse data modalities.
Flexible Data Loading Utility: A create_mix_dataloader function has been introduced to simplify the configuration and loading of this mixed dataset. It supports various parameters like dataset paths, token lengths, image sizes, and crucially, allows for weighted sampling to control the proportion of samples drawn from each underlying dataset.
MindSpore Integration: The new data pipeline leverages MindSpore's dataset API, utilizing GeneratorDataset, DistributedSampler, and WeightedRandomSampler to ensure efficient and scalable data loading for training.

Changelog

examples/janus/janus/train/mix_dataset.py
- Added MixDataset class: This class acts as a wrapper around multiple datasets, providing a unified interface to iterate over them. It includes a cumsum static method to manage cumulative sizes for indexing and ensures consistent output by adding placeholder image/mask data for text-only samples.
- Implemented create_mix_dataloader function: This function orchestrates the creation of individual TextDataset, TextImageDataset, and VqaDataset instances, combines them into a MixDataset, and then wraps it with MindSpore's GeneratorDataset and appropriate samplers (DistributedSampler, WeightedRandomSampler) to form the final dataloader.
- Included an example usage block (if __name__ == "__main__":): This demonstrates how to initialize the VLChatProcessor and create a mixed dataloader, providing a quick way to test the new functionality.

Activity

The author, Mark-ZhouWX, opened this pull request to add the Janus Pro mix dataset.
SamitHuang initiated a /gemini command, followed by requests for a /gemini review and /gemini summary.
The gemini-code-assist[bot] provided several review comments, including a PEP 8 style suggestion for class definition spacing, a recommendation to use numpy.cumsum for efficiency, identification of an unused instance variable (self.num_dataset), a warning about the deprecated np.bool type, and a suggestion for a more idiomatic NumPy approach to calculate sample_weights using np.repeat.

Mark-ZhouWX added 3 commits March 14, 2025 16:10

add mixed dataset for janus pro

52ab340

fix some bugs

3e3504d

revise path

873fe2a

Mark-ZhouWX requested a review from SamitHuang as a code owner March 17, 2025 01:41

gemini-code-assist Bot reviewed Jul 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Janus pro mix dataset#13

Janus pro mix dataset#13
Mark-ZhouWX wants to merge 3 commits into
SamitHuang:janusfrom
Mark-ZhouWX:janus

Mark-ZhouWX commented Mar 17, 2025

Uh oh!

SamitHuang commented Jul 21, 2025

Uh oh!

gemini-code-assist Bot commented Jul 21, 2025

Uh oh!

SamitHuang commented Jul 21, 2025

Uh oh!

SamitHuang commented Jul 21, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 21, 2025

Uh oh!

gemini-code-assist Bot Jul 21, 2025

Uh oh!

gemini-code-assist Bot Jul 21, 2025

Uh oh!

gemini-code-assist Bot Jul 21, 2025

Uh oh!

gemini-code-assist Bot Jul 21, 2025

Uh oh!

gemini-code-assist Bot commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		from janus.train.vqa_dataset import VqaDataset


		class MixDataset:

	image_seq_mask = np.zeros((self.max_token_length), dtype=np.bool)
	image_seq_mask = np.zeros((self.max_token_length), dtype=bool)

Conversation

Mark-ZhouWX commented Mar 17, 2025

What does this PR do?

Uh oh!

SamitHuang commented Jul 21, 2025

Uh oh!

gemini-code-assist Bot commented Jul 21, 2025

Uh oh!

SamitHuang commented Jul 21, 2025

Uh oh!

SamitHuang commented Jul 21, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot commented Jul 21, 2025

Summary of Changes

Highlights

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants